Graphic processor based accelerator system and method

ABSTRACT

An accelerator system is implemented on an expansion card comprising a printed circuit board having (a) one or more graphics processing units (GPUs), (b) two or more associated memory banks (logically or physically partitioned), (c) a specialized controller, and (d) a local bus providing signal coupling compatible with the PCI industry standards. The controller handles most of the primitive operations to set up and control GPU computation. Thus, the computer&#39;s central processing unit (CPU) can be dedicated to other tasks. In this case a few controls (simulation start and stop signals from the CPU and the simulation completion signal back to CPU), GPU programs and input/output data are exchanged between CPU and the expansion card. Moreover, since on every time step of the simulation the results from the previous time step are used but not changed, the results are preferably transferred back to CPU in parallel with the computation.

RELATED APPLICATIONS

The present application is a broadening reissue application of U.S. Pat.No. 9,189,828, filed Jan. 3, 2014, which claims a priority benefit,under 35 U.S.C. §120, as a continuation of U.S. application Ser. No.11/860,254, now U.S. Pat. No. 8,648,867 B2, filed Sep. 24, 2007,entitled “Graphic Processor Based Accelerator System and Method,” whichin turn claims the priority benefit, under 35 U.S.C. §119(e), of U.S.Application No. 60/826,892, filed Sep. 25, 2006. Each of theabove-identified applications is incorporated herein by reference in itsentirety. More than one reissue application has been filed for thereissue of U.S. Pat. No. 9,189,828, including this application and areissue continuation application filed Dec. 29, 2020.

BACKGROUND

Graphics Processing Units (GPUs) are found in video adapters (graphiccards) of most personal computers (PCs), video game consoles,workstations, etc. and are considered highly parallel processorsdedicated to fast computation of graphical content. With the advances ofthe computer and console gaming industries, the need for efficientmanipulation and display of 3D graphics has accelerated the developmentof GPUs.

In addition, manufacturers of GPUs have included general purposeprogrammability into the GPU architecture leading to the increasedpopularity of using GPUs for highly parallelizable and computationallyexpensive algorithms outside of the computer graphics domain. Whenimplemented on conventional video card architectures, these generalpurpose GPU (GPGPU) applications are not able to achieve optimalperformance, however. There is overhead for graphics-related featuresand algorithms that are not necessary for these non-video applications.

SUMMARY

Numerical simulations, e.g., finite element analysis, of large systemsof similar elements (e.g. neural networks, genetic algorithms, particlesystems, mechanical systems) are one example of an application that canbenefit from GPGPU computation. During numerical simulations, disk anduser input/output can be performed independently of computation becausethese two processes require interactions with peripheral hardware (disk,screen, keyboard, mouse, etc) and put relatively low load on the centralprocessing unit/system (CPU). Complete independence is not desirable,however; user input might affect how the computation is performed andeven interrupt it if necessary. Furthermore, the user output and thedisk output are dependent on the results of the computation. Areasonable solution would be to separate input/output into threads, sothat it is interacting with hardware occurs in parallel with thecomputation. In this case whatever CPU processing is required forinput/output should be designed so that it provides the synchronizationwith computation.

In the case of GPGPU, the computation itself is performed outside of theCPU, so the complete system comprises three “peripheral” components:user interactive hardware, disk hardware, and computational hardware.The central processing unit (CPU) establishes communication andsynchronization between peripherals. Each of the peripherals ispreferably controlled by a dedicated thread that is executed in parallelwith minimal interactions and dependencies on the other threads.

A GPU on a conventional video card is usually controlled through OpenGL,DirectX, or similar graphic application programming interfaces (APIs).Such APIs establish the context of graphic operations, within which allcalls to the GPU are made. This context only works when initializedwithin the same thread of execution that uses it. As a result, in apreferred embodiment, the context is initialized within a computationalthread. This creates complications, however, in the interaction betweenthe user interface thread that changes parameters of simulations and thecomputational thread that uses these parameters.

A solution as proposed here is an implementation of the computationalstream of execution in hardware, so that thread and contextinitialization are replaced by hardware initialization. This hardwareimplementation includes an expansion card comprising a printed circuitboard having (a) one or more graphics processing units, (b) two or moreassociated memory banks that are logically or physically partitioned,(c) a specialized controller, and (d) a local bus providing signalcoupling compatible with the PCI industry standards (this includes butis not limited to PCI-Express, PCI-X, USB 2.0, or functionally similartechnologies). The controller handles most of the primitive operationsneeded to set up and control GPU computation. As a result, the CPU isfreed from this function and is dedicated to other tasks. In this case afew controls (simulation start and stop signals from the CPU and thesimulation completion signal back to CPU), GPU programs and input/outputdata are the information exchanged between CPU and the expansion card.Moreover, since on every time step of the simulation the results fromthe previous time step are used but not changed, the results arepreferably transferred back to CPU in parallel with the computation.

In general, according to one aspect, the invention features a computersystem. This system comprises a central processing unit, main memoryaccessed by the central processing unit, and a video system for drivinga video monitor in response to the central processing unit as is common.The computer system further comprises an accelerator that uses inputdata from and provides output data to the central processing unit. Thisaccelerator comprises at least one graphics processing unit, acceleratormemory for the graphic processing unit, and an accelerator controllerthat moves the input data into the at least one graphics processing unitand the accelerator memory to generate the output data.

In the preferred, the central processing unit transfers the input datafor a simulation to the accelerator, after which the acceleratorexecutes simulation computations to generate the output data, which istransferred to the central processing unit. Preferably, the acceleratorcontroller dictates an order of execution of instructions to the atleast one graphics processing unit. The use of the separate controllerenables data transfer during execution such that the acceleratorcontroller transfers output data from the accelerator memory to mainmemory of the central processing unit.

In the preferred embodiment, the accelerator controller comprises aninterface controller that enables the accelerator to communicate over abus of the computer system with the central processing unit.

In general according to another aspect, the invention also features anaccelerator system for a computer system, which comprises at least onegraphics processing unit, accelerator memory for the graphic processingunit and an accelerator controller for moving data between the at leastone graphics processing unit and the accelerator memory.

In general according to another aspect, the invention also features amethod for performing numerical simulations in a computer system. Thismethod comprises a central processing unit loading input data into anaccelerator system from main memory of the central processing unit andan accelerator controller transferring the input data to a graphicsprocessing unit with instructions to be performed on the input data. Theaccelerator controller then transfers output data generated by thegraphic processing unit to the central processing unit as output data.

The above and other features of the invention including various noveldetails of construction and combinations of parts, and other advantages,will now be more particularly described with reference to theaccompanying drawings and pointed out in the claims. It will beunderstood that the particular method and device embodying the inventionare shown by way of illustration and not as a limitation of theinvention. The principles and features of this invention may be employedin various and numerous embodiments without departing from the scope ofthe invention.

BRIEF DESCRIPTION OF THE DRAWINGS

In the accompanying drawings, reference characters refer to the sameparts throughout the different views. The drawings are not necessarilyto scale; emphasis has instead been placed upon illustrating theprinciples of the invention. Of the drawings:

FIG. 1 is a schematic diagram illustrating a computer system includingthe GPU accelerator according to an embodiment of the present invention;

FIG. 2 is block diagram illustrating the architecture for the GPUaccelerator according to an embodiment of the present invention;

FIG. 3 is a block/flow diagram illustrating an exemplary implementationof the top level control of the GPU accelerator system;

FIG. 4 is a flow diagram illustrating an exemplary implementation of thebottom level control of the GPU accelerator system that is used toexecute the target computation; and

FIG. 5 is an example population of nine computational elements arrangedin a 3×3 square and a potential packing scheme for texture pixels,according to an implementation of the present invention.

DETAILED DESCRIPTION

FIG. 1 shows a computer system 100 that has been constructed accordingto the principles of the present invention.

In more detail, the computer system 100 in one example is a standardpersonal computer (PC). However, this only serves as an exampleenvironment as computing environment 100 does not necessarily depend onor require any combination of the components that are illustrated anddescribed herein. In fact, there are many other suitable computingenvironments for this invention, including, but not limited to,workstations, server computers, supercomputers, notebook computers,hand-held electronic devices such as cell phones, mp3 players, orpersonal digital assistants (PDAs), multiprocessor systems, programmableconsumer electronics, networks of any of the above-mentioned computingdevices, and distributed computing environments that including any ofthe above-mentioned computing devices.

In one implementation the GPU accelerator is implemented as an expansioncard 180 includes connections with the motherboard 110, on which the oneor more CPU's 120 are installed along with main, or system memory 130and mass/non volatile data storage 140, such as hard drive or redundantarray of independent drives (RAID) array, for the computer system 100.In the current example, the expansion card 180 communicates to themotherboard 110 via a local bus 190. This local bus 190 could be PCI,PCI Express, PCI-X, or any other functionally similar technology(depending upon the availability on the motherboard 110). An externalversion GPU accelerator is also a possible implementation. In thisexample, the external GPU accelerator is connected to the motherboard110 through USB-2.0, IEEE 1394 (Firewire), or similarexternal/peripheral device interface.

The CPU 120 and the system memory 130 on the motherboard 110 and themass data storage system 140 are preferably independent of the expansioncard 180 and only communicate with each other and the expansion card 180through the system bus 200 located in the motherboard 110. A system bus200 in current generations of computers have bandwidths from 3.2 GB/s(Pentium 4 with AGTL+, Athlon XP with EV6) to around 15 GB/s (XeonWoodcrest with AGTL+, Athlon 64/Opteron with Hypertransport), while thelocal bus has maximal peak data transfer rates of 4 GB/s (PCI Express16) or 2 GB/s (PCI-X 2.0). Thus the local bus 190 becomes a bottleneckin the information exchange between the system bus 200 and the expansioncard 180. The design of the expansion card and methods proposed hereinminimizes the data transfer through the local bus 190 to reduce theeffect of this bottleneck.

The system memory 130 is referred to as the main random-access memory(RAM) in the description herein. However, this is not intended to limitthe system memory 130 to only RAM technology. Other possible computerstorage media include, but are not limited to ROM, EEPROM, flash memory,or any other memory technology.

In the illustrated example, the GPU accelerator system is implemented onan expansion card 180 on which the one or more GPU's 240 are mounted. Itshould be noted that the GPU accelerator system GPU 240 is separate fromand independent of any GPU on the standard video card 150 or other videodriving hardware such as integrated graphics systems. Thus thecomputations performed on the expansion card 180 do not interfere withgraphics display (including but not limited to manipulation andrendering of images).

Various brand of GPU are relevant. Under current technology, GPU's basedon the GeForce series from NVIDIA Corporation or the Catalyst seriesfrom ATI/Advanced Micro Devices, Inc.

The output to a video monitor 170 is preferably through the video card150 and not the GPU accelerator system 180. The video card 150 isdedicated to the transfer of graphical information and connects to themotherboard 110 through a local bus 160 that is sometimes physicallyseparate from the local bus 190 that connects the expansion card 180 tothe motherboard 110.

FIG. 2 is a block diagram illustrating the general architecture of theGPU accelerator system and specifically the expansion card 180 in whichat least one GPU 240 and associated memories 210 and 250 are mounted.Electrical (signal) and mechanical coupling with a local bus 190provides signal coupling compatible with the PCI industry standards(this includes but is not limited to PCI, PCI-X, PCI Express, orfunctionally similar technology).

The GPU accelerator further preferably comprises one specificallydesigned accelerator controller 220. Depending upon the implementation,the accelerator controller 220 is field programmable gate array (FPGA)logic, or custom built application-specific (ASIC) chip mounted in theexpansion card 180, and in mechanical and signal coupling with the GPU240 and the associated memories 210 and 250. During initial design, acontroller can be partially or even fully implemented in software, inone example.

The controller 220 commands the storage and retrieval of arrays of data(on a conventional video card the arrays of data are represented astextures, hence the term ‘texture’ in this document refers to a dataarray unless specified otherwise and each element of the texture is apixel of color information), execution of GPU programs (on aconventional video card these programs are called shaders, hence theterm ‘shader’ in this document refers to a GPU program unless specifiedotherwise), and data transfer between the system bus 200 and theexpansion card 180 through the local bus 190 which allows communicationbetween the main CPU 120, RAM 130, and disk 140.

Two memory banks 210 and 250 are mounted on the expansion card 180. Insome example, these memory banks separated in the hardware, as shown, oralternatively implemented as a single, logically partitioned memorycomponent.

The reason to separate the memory into two partitions 210 250 stems fromthe nature of the computations to which the GPU accelerator system isapplied. The elements of computation (computational elements) arecharacterized by a single output variable. Such computational elementsoften include one or more equations. Computational elements are same orsimilar within a large population and are computed in parallel. Anexample of such a population is a layer of neurons in an artificialneural network (ANN), where all neurons are described by the sameequation. As a result, some data and most of the algorithms are commonto all computational elements within population, while most of the dataand some algorithms are specific for each equation. Thus, one memory,the shader memory bank 210, is used to store the shaders needed for theexecution of the required computations and the parameters that arecommon for all computational elements and is coupled with the controller220 only. The second memory, the texture memory bank 250, is used tostore all the necessary data that are specific for every computationalelement (including, but not limited to, input data, output data,intermediate results, and parameters) and is coupled with both thecontroller 220 and the GPU 240.

The texture memory bank 250 is preferably further partitioned into foursections. The first partition 250 a is designed to hold the externalinput data patterns. The second partition 250b is designed to hold thedata textures representing internal variables. The third partition 250cis designed to hold the data textures used as input at a particularcomputation step on the GPU 240. The fourth partition 250d holds thedata textures used to accommodate the output of a particularcomputational step on the GPU 240. This partitioning scheme can be donelogically, does not require hardware implementation. Also thepartitioning scheme is also altered based on new designs or needs of thealgorithms being employed. The reason for this partitioning is furtherexplained in the Data Organization section, below.

A local bus interface 230 on the controller 220 serves as a driver thatallows the controller 220 to communicate through the local bus 190 withthe system bus 200 and thus the CPU 120 and RAM 130. This local businterface 230 is not intended to be limited to PCI related technology.Other drivers can be used to interface with comparable technology as alocal bus 190.

Data Organization

Each computational element discussed above has output variables thataffect the rest of the system. For example in the case of a neuralnetwork it is the output of a neuron. A computational element alsousually has several internal variables that are used to compute outputvariables, but are not exposed to the rest of the system, not even toother elements of the same population, typically. Each of thesevariables is represented as a texture. The important difference betweenoutput variables and internal variables is their access.

Output variables are usually accessed by any element in the systemduring every time step. The value of the output variable that isaccessed by other elements of the system corresponds to the valuecomputed on the previous, not the current, time step. This is realizedby dedicating two textures to output variables—one holds the valuecomputed during the previous time step and is accessible to allcomputational elements during the current time step, another is notaccessible to other elements and is used to accumulate new values forthe variable computed during the current time step. In-between timesteps these two textures are switched, so that newly accumulated valuesserve as accessible input during the next time step, while the old inputis replaced with new values of the variable. This switch is implementedby swapping the address pointers to respective textures as described inthe System and Framework section.

Internal variables are computed and used within the same computationalelement. There is no chance of a race condition in which the value isused before it is computed or after it has already changed on the nexttime step because within an element the processing is sequential.Therefore, it is possible to render the new value of internal variableinto the same texture where the old was read from in the texture memorybank. Rendering to more than one texture from a single shader is notimplemented in current GPU architectures, so computational elements thattrack internal variables would have to have one shader per variable.These shaders can be executed in order with internal variables computedfirst, followed by output variables.

Further savings of texture memory is achieved through using multiplecolor components per pixel (texture element) to hold data. Textures canhave up to four color components that are all processed in parallel on aGPU. Thus, to maximize the use of GPU architecture it is desirable topack the data in such a way that all four components are used by thealgorithm. Even though each computational element can have multiplevariables, designating one texture pixel per element is ineffectivebecause internal variables require one texture and output variablesrequire two textures. Furthermore, different element types havedifferent numbers of variables and unless this number is precisely amultiple of four, texture memory can be wasted.

A more reasonable packing scheme would be to pack four computationalelements into a pixel and have separate textures for every variableassociated with each computational element. In this case the packingscheme is identical for all textures, and therefore can be accessedusing the same algorithm. Several ways to approach this packing schemeare outlined here. An example population of nine computational elementsarranged in a 3×3 square (FIG. 5a) can be packed by element (FIG. 5b),by row (FIG. 5c), or by square (FIG. 5d).

Packing by element (FIG. 5b) means that elements 1,2,3,4 go into firstpixel; 5,6,7,8 go into second pixel; 9 goes into third pixel. This isthe most compact scheme, but not convenient because the geometricalrelationship is not preserved during packing and its extraction dependson the size of the population.

Packing by row (column; FIG. 5c) means that elements 1,2,3 go into pixel(1,1); 3,4,5 go into pixel (2,1), 7,8,9 go into pixel (3,1). With thisscheme the element's y coordinate in the population is the pixel's ycoordinate, while the element's x coordinate in the population is thepixel's x coordinate times four plus the index of color component. Fiveby five populations in this case will use 2×5 texture, or 10 pixels.Five of these pixels will only use one out of four components, so itwastes 37.5% of this texture. 25×1 population will use 6×1 texture (sixpixels) and will waste 12.5% of it.

Packing by square (FIG. 5d) means that elements 1,2,4,5 go into pixel(1,1); 3,6 go into pixel (1,2); 7,8 go into pixel (2,1), and 9 goes intopixel (2,2). Both the row and the column of the element are determinedfrom the row (column) of the pixel times two plus the second (first) bitof the color component index. Five by five populations in this case willuse 3×3 texture, or 9 pixels. Four of these pixels will only use two outof four components, and one will only use one component, so it wastes34.4% of this texture. This is more advantageous than packing by row,since the texture is smaller and the waste is also lower. 25×1population on the other hand will use 13×1 texture (thirteen pixels) andwaste >50% of it, which is much worse than packing by row.

In order to eliminate waste altogether the population should have evendimensions in the square packing, and it should have a number of columnsdivisible by four in row packing. Theoretically, the chances areapproximately equivalent for both of these cases to occur, so theparticular task and data sizes should determine which packing scheme ispreferable in each individual case.

The System and Framework

FIG. 3 shows an exemplary implementation of the top level system andmethod that is used to control the computation. It is a representationof one of several ways in which a system and method for processingnumerical techniques can be implemented in the invention describedherein and so the implementation is not intended to be limited to thefollowing description and accompanying figure.

The method presented herein includes two execution streams that run onthe CPU 120—User Interaction Stream 302 and Data Output Stream 301.These two streams preferably do not interact directly, but depend on thesame data accumulated during simulations. They can be implemented asseparate threads with shared memory access and executed on differentCPUs in the case of multi-CPU computing environment. The third executionstream—Computational Stream 303—runs on the GPU accelerator of theexpansion card 180 and interacts with the User Interaction Stream 302through initialization routines and data exchange in betweensimulations. The Computational Stream 303 interacts with the UserInteraction Stream and the Data Output Stream through synchronizationprocedures during simulations.

The crucial feature of the interaction between the User InteractionStream 302 and the Computational Stream 303 is the shift of priorities.Outside of the simulation, the system 100 is driven by the user input,thus the User Interaction Stream 302 has the priority and controls thedata exchange 304 between streams. After the user starts the simulation,the Computational Stream 303 takes the priority and controls the dataexchange between streams until the simulation is finished or interrupted350.

The user starts 300 the framework through the means of an operatingsystem and interacts with the software through the user interactionsection 305 of the graphic user interface 306 executed on the CPU 120.The start 300 of the implementation begins with a user action thatcauses a GUI initialization 307, Disk input/output initialization 308 onthe CPU 120, and controller initialization 320 of the GPU accelerator onthe expansion card 180. GUI initialization includes opening of the mainapplication window and setting the interface tools that allow the userto control the framework. Disk I/O initialization can be performed atthe start of the framework, or at the start of each individualsimulation.

The user interaction 305 controls the setting and editing of thecomputational elements, parameters, and sources of external inputs. Itspecifies which equations should have their output saved to disk and/ordisplayed on the screen. It allows the user to start and stop thesimulation. And it performs standard interface functions such as fileloading and saving, interactive help, general preferences and others.

The user interaction 305 directs the CPU 120 to acquire the new externalinput textures needed (this includes but is not limited to loading fromdisk 140 or receiving them in real time from a recording device), parsesthem if necessary 309, and initializes their transfer to the expansioncard 180, where they are stored 325 in the texture memory bank 250 bythe controller 220. The user interaction 305 also directs the CPU 120 toparse populations of elements that will be used in the simulation,convert them to GPU programs (shaders), compile them 310, andinitializes their transfer to the expansion card 180, where they arestored 326 in the shader memory bank 210 by the controller 220. Thisoperation is accompanied by the upload 309 of the initial data into theinput partition of the texture memory bank 250, and stores the shaderorder of execution in the controller 220. The user can performoperations 309 and 310 as many times as necessary prior to starting thesimulation or between simulations.

The editing of the system between simulations is difficult to accomplishwithout the hardware implementation of the computational threadsuggested herein. The system of equations (computational elements) isrepresented by textures that track variables plus shaders that defineprocessing algorithms. As mentioned above, textures, shaders and othergraphics related constructs can only be initialized within the renderingcontext, which is thread specific. Therefore textures and shaders canonly be initialized in the computational thread.

Network editing is a user-interactive process, which according to thescheme suggested above happens in the User Interaction Stream 302. Thesimulation software thus has to take the new parameters from the UserInteraction Stream 302, communicate them to the Computational Stream 303and regenerate the necessary shaders and textures. This is hard toaccomplish without a hardware implementation of the Computational Stream303. The Computational Stream 303 is forked from the User InteractionStream and it can access the memory of the parent thread, but thereverse communication is harder to achieve. The controller 220 allowsoperations 309 and 310 to be performed as many times as necessary byproviding the necessary communication to the User Interaction Stream302.

After execution of the input parser texture generation 309 andpopulation parser shader generator and compiler 310 are performed atleast once, the user has the option to initialize the simulation 311.During this initialization the main control of the framework istransferred to the GPU accelerator system's accelerator controller 220and computation 330 is started (see FIG. 4; 420). The user retains theability to interrupt the simulation, change the input, or to change thedisplay properties of the framework, but these interactions are queuedto be performed at times determined by the controller-driven dataexchange 314 and 316 to avoid the corruption of the data.

The progress monitor 312 is not necessary for performance, but addsconvenience. It displays the percentage of completed time steps of thesimulation and allows the user to plan the schedule using the estimatesof the simulation wall clock times. Controller-driven data exchange 314updates the display of the results 313. Online screen output for theuser selected population allows the user to monitor the activity andevaluate the qualitative behavior of the network. Simulations withunsatisfactory behavior can be terminated early to change parameters andrestart. Controller-driven data exchange 314 also drives the output ofthe results to disk 317. Data output to disk for convenience can be doneon an element per file basis. A suggested file format includes aleftmost column that displays a simulated time for each of thesimulation steps and subsequent columns that display variable valuesduring this time step in all elements with identical equations (e.g. allneurons in a layer of a neural network).

Controller-driven data exchange or input parser texture generator 316allows the user to change input that is generated on the fly during thesimulation. This allows the framework monitoring of the input that iscoming from a recording device (video camera, microphone, cell recordingelectrode, etc) in real time. Similar to the initial input parser 309,it preprocesses the input into a universal format of the data arraysuitable for texture generation and generates textures. Unlike theinitial parser 309, here the textures are transferred to hardware notwhenever ready but upon the request of the controller 220.

The controller 220 also drives the conditional testing 315 and 318informs the CPU-bound streams whether the simulation is finished. If so,the control returns to the User Interaction Stream. The user then canchange parameters or inputs (309 and 310), restart the simulation (311)or quit the framework (390).

SANNDRA (Synchronous Artificial Neuronal Network Distributed RuntimeAlgorithm; http://www.kinness.net/Docs/SANNDRA/html) was developed toaccelerate and optimize processing of numerical integration of largenon-homogenous systems of differential equations. This library is fullyreworked in its version 2.x.x to support multiple computational backendsincluding those based on multicore CPUs, GPUs and other processingsystems. GPU based backend for SANNDRA-2.x.x can serve as an examplepractical software implementation of the method and architecturedescribed above and pictorially represented in FIG. 3.

To use SANNDRA, the application should create a TSimulator object eitherdirectly or through inheritance. This object will handle globalsimulation properties and control the User Interaction Stream, DataOutput Stream, and Computational Stream. Through TSimulator::timestep(), TSimulator::outfileInterval( ), and TSimulator::outmode( ), theapplication can set the time step of the simulation, the time step ofdisk output, and the mode of the disk output. The external input patternshould be packed into a TPattern object and bound to the simulationobject through TSimulator::resetInputs( ) method. TSimulator::simLength() sets the length of the simulation.

The second step is to create at least one population of equations(Tpopulation object). Population holds one equation object TEquation.This object contains only a formula and does not hold element-specificdata, so all elements of the population can share single TEquation.

The TEquation object is converted to a GPU program before execution. GPUprograms have to be executed within a graphical context, which is streamspecific. TSimulator creates this context within a Computational Stream,therefore all programs and data arrays that are necessary forcomputation have to be initialized within Computational Stream.Constructor of TPopulation is called from User Interaction Stream, so noGPU-related objects can be initialized in this constructor.

TPopulation::fillElements( ) is a virtual method designed to overcomethis difficulty. It is called from within the Computational Stream afterTSimulator::networkCreate( ) is called in the User Interaction Stream. Auser has to override TPopulation::fillElements( ) to create TEquationand other computation related objects both element independent andelement-specific. Element independent objects include subcomponents ofTEquation and objects that describe how to handle interdependenciesbetween variables implemented through derivatives of TGate class.

Element-specific data is held in TElement objects. These objects holdreferences to TEquation and a set of TGate objects. There is oneTElement per population, but the size of data arrays within this objectcorresponds to population size. All TElement objects have to be added tothe TSimulator list of elements by calling TSimulator::addUnit( ) methodfrom TPopulation::fillElements( ).

Finally, TPopulation::fillElements( ) should contain a set ofTElement::add*Dependency( ) calls for each element. Each of these callssets a corresponding dependency for every TGate object. Here TGateobject holds element independent part of dependency andTElement::add*Dependency( ) sets element-specific details.

System provided TPopulation handles the output of computationalelements, both when they need to exchange the data and when they need tooutput it to disk. User implementation of TPopulation derivative can addscreen output.

Listing 1 is an example code of the user program that uses a recurrentcompetitive field (RCF) equation:

Listing 1 uint16_t w = 3, h = 3; static float m_compet = 0.5; staticfloat m_persist = 1.0; class TCablePopRCF : public TPopulation {TEq_RCF* m_equation; TGate* m_gatel; TGate* m_gate2; voidcreateGatingStructure( ) { m_gatel = new TGate(0); m_gate2 = newTGate(1); }; void createUnitStructure(TBasicUnit* u) {u->addO2OPInputDependency(m_gatel, 0., 0., 0.004, 0., 0, 0);u->addFullDependency(m_gate2, population( )); } public: TCablePopRCF( ): TPopulation(“compCPU RCF”, w, h, true) { }; ~TCablePopRCF( ){if(m_equation) delete m_equation;  if(m_gatel) delete m_gatel; if(m_gate2) delete m_gate2;}; bool fillElements(TSimulator* sim); };bool TCablePopRCF::fillElements(TSimulatior* sim) { m_equation = newTEq_RCF(this, m_compet, m_persist); createGatingStructure( ); for(size_ti = 0; i < xSize( ); ++i) for(size_t j = 0; j < ySize( ); ++j) {TElement* u = new TCPUElement(this, m_equation, i, j); sim->addUnit(u);createUnitStructure(u); } Return true; } int main( ) { // Input patterngeneration (309 in FIG.3) uint32_t* pat = new uint32_t[w*h];TRandom<float> randGen (0); for(uint32_t I = 0; I < w*h; ++i) pat[i] =randGen.random( ); Tpattern* p = new Tpattern(pat, w, h); // Setting upthe simulation TSimulator* cableSim = new TSimulator(“data”); //(308 and320 in FIG. 3) cableSim->timestep(0.05); //(320 in FIG. 3)cableSim->resetInputs(p); //(325 in FIG. 3)cableSim->outfileInterval(0.1); //(308 in FIG. 3)cableSim->outmode(SANNDRA::timefunc); //(308 in FIG. 3)cableSim->simLength(60.0); //(320 in FIG. 3) // Preparing the populationTPopulation* cablePop = new TCablePopRCF( ); //(310 in FIG. 3)cableSim->networkCreate( ); //(326 in FIG. 3) uint16_t user= 1;while(user) { if(! cableSim->simulationStart(true, 1)) //(311 in FIG. 3)exit(1); std::cout<<“Repeat?\n”; //(305 in FIG. 3) std::cin>>user;//(305 in FIG. 3) if(user == 1) cableSim->networkReset( ); //(305 inFIG. 3) { If(cableSim) Delete cableSim; //Also deletes cablePop and itsinternals exit(0); };

FIG. 4 is a detailed flow diagram illustrating a part of an exemplaryimplementation of the bottom level system and method performed duringthe computation on the GPU accelerator of the expansion card 180 and isa more detailed view of the computational box 330 in FIG. 3. FIG. 4 is arepresentation of one of several ways in which a system and method forprocessing numerical techniques can be implemented.

With systems of equations that have complex interdependencies it islikely that the variable in some equation from a previous time step hasto be used by some other equation after the new values of this variableare already computed for new time step. To avoid data confusion, the newvalues of variables should be rendered in a separate texture. After thetime step is completed for all equations, these new values should becopied over old values so that they are used as input during the nexttime step. Copying textures is an expensive operation, computationally,but since the textures are referred to by texture IDs (pointers),swapping these pointers for input and output textures after each timestep achieves the same result at a much lesser cost.

In the hardware solution suggested herein, ID swapping is equivalent toswapping the base memory address for two partitions of the texturememory bank 250. They are swapped 485 during synchronization (485, 430,and 455) so that data transfer 445 and the computation 435-487 proceedsimmediately and in parallel with data transfer as shown in FIG. 4. Ahardware solution allows this parallelism through access of thecontroller 220 to the onboard texture memory bank 250.

The main computation and data exchange are executed by the controller220. It runs three parallel substreams of execution: ComputationalSubstream 403, Data Output Substream 402, and Data Input Substream 404.These streams are synchronized with each other during the swap ofpointers 485 to the input and output texture memory partitions of thetexture memory bank 250 and the check for the last iteration 487.Algorithmically, these two operations are a single atomic operation, butthe block diagram shows them as two separate blocks for clarity.

The Computational Substream 403 performs a computational cycle includinga sequential execution of all shaders that were stored in the shadermemory bank 210 using the appropriate input and output textures. Tobegin the simulation the controller 220 initializes three executionsubstreams 403, 402, and 404. On every simulation step, theComputational Substream 403 determines which textures the GPU 240 willneed to perform the computations and initiates the upload 435 of themonto the GPU 240. The GPU 240 can communicate directly with the texturememory bank 250 to upload the appropriate texture to perform thecomputations. The controller 220 also pulls the first shader (known bythe stored order) from the shader memory bank 210 and uploads 450 itonto the GPU 240.

The GPU 240 executes the following operations in this order: performsthe computation (execution of the shader) 470; tells the controller 220that it is done with the computations for the current shader; and afterall shaders for this particular equation are executed sends 480 theoutput textures to the output portion of the texture memory bank 250.This cycle continues through all of the equations based on the branchingstep 482.

An example shader that performs fourth order Runge-Kutta numericalintegration is shown in Listing 2 using GLSL notation;

Listing 2 uniform sampler2DRect Variable; uniform floatintegration_step; float halfstep = integration_step*0.5; float fl_6step= integration_step/6.0; vec4 output = texture2DRect(Variable,gl_TexCoord[0].st); // define equation( ) here vec4 rungekutta4(vec4 x){ const vec4 kl = equation(x); const vec4 k2 = equation(x +halfstep*kl); const vec4 k3 = equation(x + halfstep*k2); const vec4 k4 =equation(x + integration step*k3); return fl_6step*(k1 + 2.0*(k2 + k3) +k4); } Void main(void) { output += rungekutta4(output); gl_FragColor =output; }

The shader in Listing 2 can be executed on conventional video card.Using the controller 220 this code can be further optimized, however.Since the integration step does not change during the simulation, thestep itself as well as the halfstep and ⅙ of the step can be computedonce per simulation, and updated in all shaders by a shader updateprocedures 310, 326 discussed above.

After all of the equations in the computational cycle are computed themain execution substream 403 on the controller 220 can switch 485 thereference pointers of the input and output portions of the texturememory bank 250.

The two other substreams of execution on the controller 220 are waiting(blocks 430 and 455, respectively) for this switch to begin theirexecution. The Data Input Substream 404 is controlling 440 the input ofadditional data from the CPU 120. This is necessary in cases where thesimulation is monitoring the changing input, for example input from avideo camera or other recording device in the real time. This substreamuploads new external input from the CPU 120 to the texture memory bank250 so it can be used by the main computational substream 403 on thenext computational step and waits for the next iteration 475. The DataOutput Substream 445 controls the output of simulation results to theCPU 120 if requested by the user. This substream uploads the results ofthe previous step to the main RAM 130 so that the CPU 120 can save themon disk 140 or show them on the results display 313 and waits for thenext iteration 460.

Since the Computational Substream 403 determines the timing of input 440and output 445 data transfers, these data transfers are driven by thecontroller 220. To further reduce the data transfer overhead (and disk140 overhead also) the controller 220 initiates transfer only afterselected computational steps. For example, if the experimental data thatis simulated was recorded every 10 milliseconds (msec) and thesimulation for better precision was computed every 1 msec, then onlyevery tenth result has to be transferred to match the experimentalfrequency.

This solution stores two copies of output data, one in the expansioncard texture memory bank 250 and another in the system RAM 130. The copyin the system RAM 130 is accessed twice: for disk I/O and screenvisualization 313. An alternative solution would be to provide CPU 120with a direct read access to the onboard texture memory bank 250 bymapping the memory of the hardware onto a global memory space. Thealternative solution will double the communication through the local bus190. Since the goal discussed herein is reducing the informationtransfer through the local bus 190, the former solution is favored.

The main substream 403 determines if this is the last iteration 487. Ifit is the last iteration, the controller 220 waits for the all of theexecution substreams to finish 490 and then returns the control to theCPU 120, otherwise it begins the next computational cycle.

This repeats through all of the computational cycles of the simulation.

CONCLUSION

This GPU accelerator system offers the following potential advantages:

1. Limited computations on the CPU 120. The CPU 120 is only used foruser input, sending information to the controller 220, receiving outputafter each computational cycle (or less frequently as defined by theuser), writing this output to disk 140, and displaying this output onthe monitor 170. This frees the CPU 120 to execute other applicationsand allows the expansion card to run at its full capacity without beingslowed down by extensive interactions with the CPU 120.

2. Minimizing data transfer between the expansion card 180 and thesystem bus 200. All of the information needed to perform the simulationswill be stored on the expansion card 180 and all simulations will takeplace on it. Furthermore, whatever data transfer remains necessary willtake place in parallel with the computation, thus reducing the impact ofthis transfer on the performance.

3. New way to execute GPU programs (shaders). Previously, the CPU 120had full control over the order of shader's execution and was requiredto produce specific commands on every cycle to tell the GPU 240 whichshader to use. With the invention disclosed herein, shaders willinitially be stored on the shader memory bank 210 on the expansion card180 and will be sent to the GPU 240 for execution by the general purposecontroller 220 located on the expansion card.

4. Multiple parallelisms. The GPU 240 is inherently parallel and is wellsuited to perform parallel computations. In parallel with the GPU 240performing the next calculation, the controller 220 is uploading thedata from the previous calculation into main memory 130. Furthermore,the CPU 120 at the same time uses uploaded previous results to save themonto disk 140 and to display them on the screen through the system bus200.

5. Reuse of existing and affordable technology. All hardware used in theinvention and mentioned here-in are based on currently available andreliable components. Further advance of these components will providestraightforward improvements of the invention.

While this invention has been particularly shown and described withreferences to preferred embodiments thereof, it will be understood bythose skilled in the art that various changes in form and details may bemade therein without departing from the scope of the inventionencompassed by the appended claims.

What is claimed is:
 1. A computer system, comprising: a centralprocessing unit to receive input data; main memory, operably coupled tothe central processing unit via a bus, to store the input data receivedby the central processing unit; an accelerator, operably coupled to thecentral processing unit and the first main memory via the bus, toreceive at least a portion of the input data from the main memory, theaccelerator comprising: at least one graphics processing unit to performa sequence of computations on the at least a portion of the input dataso as to generate output data, the sequence of computations representingan artificial neural network, intermediate computations in the sequenceof computations representing respective layers of the artificial neuralnetwork and yielding intermediate results; and accelerator memory,operably coupled to the graphic at least one graphics processing unit,to store the results of the plurality of sequential sequence ofcomputations; and a controller, operably coupled to the at least onegraphics processing unit and the accelerator memory, to initializetextures and shaders in the accelerator memory for performing thesequence of computations, to control performance of the sequence ofcomputations by the at least one graphics processing unit, to transferthe at least a portion of the input data into the accelerator memoryduring performance of the intermediate computations in the sequence ofcomputations by the at least one graphics processing unit, and totransfer at least a portion of the output data from the acceleratormemory to the main memory during performance of the intermediatecomputations in the sequence of computations by the at least one graphicgraphics processing unit.
 2. The computer system of claim 1, wherein thecentral processing unit is configured to receive the input data inresponse to a user interaction.
 3. The computer system of claim 1,wherein: the central processing unit is configured to receive the inputdata at a first rate; and the at least one graphics processing unit isconfigured to perform the sequence of computations at a second ratedifferent than the first rate.
 4. The computer system of claim 1,wherein the main memory is configured to store a copy of the output datastored in the accelerator memory.
 5. The computer system of claim 1,wherein an output of at least one computation in the sequence ofcomputations represents an output of at least one neuron in anartificial neural network.
 6. The computer system of claim 1, whereinaccelerator memory comprises: a first memory bank to store parameterscommon to all of the computations in the sequence of computations; and asecond memory bank to store data specific to at least one computation inthe sequence of computations.
 7. The computer system of claim 1, whereinthe controller is configured to transfer the output data from theaccelerator memory to the main memory without transferring any of theintermediate results from the accelerator memory to the main memory soas to reduce data transfer via the bus.
 8. The computer system of claim1, wherein the controller is configured to transfer at least a portionof the output data from the accelerator memory to the main memory afterthe at least one graphics processing unit has begun to perform anothersequence of computations.
 9. The computer system of claim 8, wherein thecontroller is configured to initiate transfer of the at least a portionof the input data and to transfer the at least a portion of the outputdata in parallel with performance of at least one computation in theother sequence of computations by the at least one graphics processingunit.
 10. The computer system of claim 1, wherein the controller isconfigured to control execution of the sequence of computations by theat least one graphics processing unit.
 11. The computer system of claim1, further comprising: at least one of a video camera, a microphone, ora cell recording electrode, operably coupled to the central processorprocessing unit, to acquire the input data in real time.
 12. A method ofperforming a sequence of computations representing an artificial neuralnetwork on a computer system comprising a central processing unit (CPU),a main memory operably coupled to the central processing unit via a bus,an accelerator operably coupled to the CPU and the main memory via thebus, the accelerator comprising a graphics processing unit (GPU) and anaccelerator memory, the method comprising: (A) performing, by the GPU,the sequence of computations on a first portion of the input data so asto generate a first portion of the output data, the first portion of theoutput data representing an output of a neuron in a first layer of theartificial neural network, intermediate computations in the sequence ofcomputations yielding intermediate results, wherein performing thesequence of computations on the first portion of the input datacomprises (i) assigning an output variable to a first texture and asecond texture, the output variable being included in a firstcomputational element of a plurality of computational elements, theplurality of computational elements representing the sequence ofcomputations and (ii) accumulating a first value for the output variablein the first texture during a first time step; (B) in parallel withperforming the sequence of computations by the GPU in (A), transferringa second portion of the input data from the main memory to theaccelerator via the bus; and (C) in parallel with performing thesequence of computations by the GPU in (A), transferring a secondportion of the output data from the accelerator memory to the mainmemory via the bus, the second portion of the output data representingan output of a neuron in a second layer in the artificial neuralnetwork; and (D) performing, by the GPU, the sequence of computations onthe second portion of the input data, wherein performing the sequence ofcomputations on the second portion of the input data comprises (i)accumulating a second value for the output variable in the secondtexture during a second time step and (ii) making the first value of theoutput variable in the first texture accessible to other computationalelements in the plurality of computational elements during the secondtime step.
 13. The method of claim 12, further comprising: storing theinput data in the main memory in response to a user interaction.
 14. Themethod of claim 12, further comprising: receiving the input data at afirst rate; and wherein (A) comprises performing the sequence ofcomputations at a second rate different than the first rate.
 15. Themethod of claim 12, wherein (A) comprises: generating an outputrepresentative of an output of at least one neuron in an artificialneural network.
 16. The method of claim 12, wherein (C) comprises:transferring the second portion of the output data from the acceleratormemory to the main memory without transferring any of the intermediateresults of the plurality of sequential computations from the acceleratormemory to the main memory so as to reduce data transfer via the bus. 17.The method of claim 12, wherein (C) comprises: transferring the secondportion of the output data from the accelerator memory to the mainmemory after the GPU has begun to perform another sequence ofcomputations.
 18. The method of claim 17, wherein (C) further comprises:initiating transfer of the second portion of the output data in parallelwith performance of at least one computation in the other sequence ofcomputations.
 19. The method of claim 12, further comprising: acquiringthe input data in real time with at least one of a video camera, amicrophone, or a cell recording electrode operably coupled to the CPU.20. The method of claim 12, further comprising: storing parameterscommon to all of the computations in the sequence of computations in afirst memory bank in the accelerator memory; and storing data specificto at least one computation in the sequence of computations in a secondmemory bank in the accelerator memory.
 21. A method of performing asequence of computations representing an artificial neural network, themethod comprising: receiving, at a central processing unit (CPU), firstinput data acquired from an external system in real time; initializing,by a controller operably coupled to a graphics processing unit (GPU),textures and shaders in a memory operably coupled to the GPU;transferring the first input data received by the CPU to the memoryoperably coupled to the GPU; performing, by the graphics processing unit(GPU), a first computation in the sequence of computations on the firstinput data based on the textures and shaders to generate first outputdata, computations in the sequence of computations representingrespective layers of neurons in the artificial neural network, an outputof the first computation in the sequence of computations representing anoutput of a first neuron in a first layer in the artificial neuralnetwork; storing, in the memory operably coupled to the GPU, the firstinput data and the first output data; and transferring second input dataacquired from the external system in real time into the memory operablycoupled to the GPU after the GPU starts the first computation and beforethe GPU starts a second computation of the sequence of computations, anoutput of the second computation in the sequence of computationsrepresenting an output of a second neuron in a second layer in theartificial neural network.
 22. The method of claim 21, whereintransferring the second input data comprises transferring the secondinput data via a bus operably coupled to the CPU.
 23. The method ofclaim 21, further comprising: transferring the first output data fromthe memory to another memory during the second computation in thesequence of computations.
 24. The method of claim 23, furthercomprising: storing intermediate results of the sequence of computationsin the memory, and wherein transferring the first output data from thememory to the other memory occurs without transferring the intermediateresults of the sequence of computations.
 25. The method of claim 23,wherein transferring the second input data and transferring the firstoutput data occurs in parallel.
 26. The method of claim 21, furthercomprising: storing, in a first memory partition of the memory,parameters common to all of the computations in the sequence ofcomputations.
 27. The method of claim 26, further comprising: storing,in a second memory partition of the memory, data specific to the firstcomputation in the sequence of computations.
 28. The method of claim 27,further comprising: storing, in the second memory partition, externalinput data patterns, representations of internal variables, an input ofthe computation in the sequence of computations, and the output of thecomputation in the sequence of computations.
 29. The method of claim 21,wherein storing the first output data comprises: accumulating, in thememory, outputs of computational elements executed by the GPU inperforming the first computation in the sequence of computations. 30.The method of claim 21, further comprising: storing, in the memory, anoutput of a previous computation in the sequence of computations; andaccessing, by the GPU, the output of the previous computation duringperformance of the computation in the sequence of computations.
 31. Themethod of claim 21, wherein performing the first computation comprisesexecuting a plurality of computational elements representing a layer ofneurons in an artificial neural network.
 32. The method of claim 31,wherein all neurons in the layer of neurons are described by the sameequation.
 33. The method of claim 21, further comprising: acquiring thesecond input data with at least one of a video camera, a microphone, ora cell recording electrode.
 34. The method of claim 21, furthercomprising: loading the second input data from disk.
 35. A system forperforming a sequence of computations, the system comprising: a camerato generate input data in real time; a first memory partition; a secondmemory partition operably coupled to the first memory partition; and aprocessing unit, operably coupled to the camera, the first memorypartition, and the second memory partition, to perform the sequence ofcomputations on a first portion of the input data so as to generate afirst portion of output data, intermediate computations in the sequenceof computations yielding intermediate results, the first portion of theoutput data representing an output of an artificial neural network,wherein the first memory partition is configured to transfer a secondportion of the input data to the second memory partition in parallelwith performance the sequence of computations by the processing unit,wherein the second memory partition is configured to transfer a secondportion of the output data to the first memory partition in parallelwith performance the sequence of computations by the processing unit,and wherein the sequence of computations represents the artificialneural network, each neuron in the artificial neural network has anoutput variable assigned to a first texture and a second texture in thememory, the first texture holds a first value of the output variablecomputed during a previous time step of the sequence of computations andaccessible to other neurons in the neural network during a current timestep of the sequence of computations and the second texture accumulatesa second value of the output variable computed during the current timestep.
 36. The system of claim 35, wherein the first memory partition andthe second memory partition are logical partitions.
 37. The system ofclaim 35, wherein the processing unit is comprises a graphics processingunit (GPU).
 38. The system of claim 35, wherein the processing unit isconfigured to receive the input data at a first rate and to perform thesequence of computations at a second rate is different than the firstrate.
 39. The system of claim 35, wherein the second memory partition isconfigured to transfer the second portion of the output data to thefirst memory partition without transferring any of the intermediateresults to the first memory partition.
 40. A system for executing anartificial neural network, the system comprising: a central processingunit (CPU) to provide first input data; a memory, operably coupled tothe CPU, to store the first input data in a first partition, referencedby a first pointer, before computing a first layer of neurons of theartificial neural network; a processing unit, operably coupled to thememory, to perform, during computation of the first layer of neurons, atleast one calculation on the first input data so as to generate firstoutput data, the first output data representing an output of at leastone neuron in the first layer of neurons; and a controller, operablycoupled to the processing unit and the memory, to: store the firstoutput data in a second partition of the memory, the second partitionreferenced by a second pointer, and to swap the first pointer with thesecond pointer at the end of the computation of the first layer ofneurons, such that the first output data becomes an input for a secondlayer of neurons of the artificial neural network, transfer the firstoutput data to another memory during computation of the second layer ofneurons, and dictate an order of execution of instructions to theprocessing unit to perform the computation of the first layer ofneurons.
 41. The system of claim 40, wherein the processing unitcomprises a graphics processing unit.
 42. The system of claim 40,wherein the controller is configured to send instructions for performingthe at least one calculation to the processing unit.
 43. The system ofclaim 40, wherein the memory further comprises: a third partition tostore internal variables; and a fourth partition to store data used asinput at a particular layer of neurons of the artificial neural network.44. A computer system, comprising: a central processing unit to receiveinput data acquired from an external system; main memory, operablycoupled to the central processing unit via a bus, to store the inputdata received by the central processing unit; an accelerator, operablycoupled to the central processing unit and the main memory via the bus,to receive at least a portion of the input data from the main memory,the accelerator comprising: at least one processing unit to perform asequence of computations representing an artificial neural network onthe at least a portion of the input data so as to generate output data,intermediate computations in the sequence of computations representinglayers of the neural network and yielding intermediate results; andaccelerator memory, operably coupled to the at least one processingunit, to store the results of the sequence of computations; and acontroller, operably coupled to the at least one processing unit and theaccelerator memory, to control transfer of the at least a portion of theinput data into the accelerator memory during performance of theintermediate computations in the sequence of computations by the atleast one processing unit, to control transfer at least a portion of theoutput data from the accelerator memory to the main memory duringperformance of the intermediate computations in the sequence ofcomputations by the at least one processing unit, and to controlperformance of the sequence of computations by the at least oneprocessing unit.
 45. The computer system of claim 44, wherein thecentral processing unit is configured to receive the input data inresponse to a user interaction.
 46. The computer system of claim 44,wherein: the central processing unit is configured to receive the inputdata at a first rate; and the at least one processing unit is configuredto perform the sequence of computations at a second rate different thanthe first rate.
 47. The computer system of claim 44, wherein the mainmemory is configured to store a copy of the output data stored in theaccelerator memory.
 48. The computer system of claim 44, wherein anoutput of at least one computation in the sequence of computationsrepresents an output of at least one neuron in an artificial neuralnetwork.
 49. The computer system of claim 44, wherein accelerator memorycomprises: a first memory partition to store parameters common to all ofthe computations in the sequence of computations; and a second memorypartition to store data specific to at least one computation in thesequence of computations.
 50. The computer system of claim 44, whereinthe controller is configured to transfer the output data from theaccelerator memory to the main memory without transferring any of theintermediate results from the accelerator memory to the main memory soas to reduce data transfer via the bus.
 51. The computer system of claim44, wherein the controller is configured to transfer at least a portionof the output data from the accelerator memory to the main memory afterthe at least one processing unit has begun to perform another sequenceof computations.
 52. The computer system of claim 51, wherein thecontroller is configured to initiate transfer of the at least a portionof the input data and to transfer the at least a portion of the outputdata in parallel with performance of at least one computation in theother sequence of computations by the at least one processing unit. 53.The computer system of claim 44, wherein the controller is configured tocontrol execution of the sequence of computations by the at least oneprocessing unit.
 54. The computer system of claim 44, furthercomprising: at least one of a video camera, a microphone, or a cellrecording electrode, operably coupled to the central processing unit, toacquire the input data in real time.
 55. The computer system of claim 1,wherein the controller is configured to inform the central processingunit that the sequence of computations is finished.
 56. The computersystem of claim 1, wherein the controller is configured to reduce aprocessing load on the central processing unit.
 57. The computer systemof claim 1, wherein the controller is configured to reduce interactionsbetween the central processing unit and the accelerator.